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Abstract 


The Hajek-Feldman dichotomy establishes that two Gaussian measures are either mutually 
absolutely continuous with respect to each other (and hence there is a Radon-Nikodym density 
for each measure with respect to the other one) or mutually singular. Unlike the case of finite 
dimensional Gaussian measures, there are non-trivial examples of both situations when dealing 
with Gaussian stochastic processes. This paper provides: 

(a) Explicit expressions for the optimal (Bayes) rule and the minimal classification error 
probability in several relevant problems of supervised binary classification of mutually abso¬ 
lutely continuous Gaussian processes. The approach relies on some classical results in the 
theory of Reproducing Kernel Hilbert Spaces (RKHS). 

(b) An interpretation, in terms of mutual singularity, for the “near perfect classification” 


phenomenon described by Delaigle and Hall (20121. We show that the asymptotically optimal 


rule proposed by these authors can be identified with the sequence of optimal rules for an 
approximating sequence of classification problems in the absolutely continuous case. 

(c) A new model-based method for variable selection in binary classification problems, 
which arises in a very natural way from the explicit knowledge of the RN-derivatives and the 
underlying RKHS structure. Different classifiers might be used from the selected variables. In 
particular, the classical, linear finite-dimensional Fisher rule turns out to be consistent under 
some standard conditions on the underlying functional model. 


Keywords: absolutely continuous processes, Radon-Nikodym derivatives, singular pro¬ 
cesses, supervised functional classification, variable selection. 

AMS 2010 subject classifications: Primary 62H30; secondary 62G99. 


1 Introduction 


In the booming field of statistics with functional data [see Cuevas (2014) for a recent survey], 
the computational and numerical aspects, as well as the real data applications, have had (un¬ 
derstandably) a major role so far. However, the underlying probabilistic theory, connecting 
the models which generate the data (i.e., the stochastic processes) with the statistical func¬ 
tional methods is far less developed. The present work is an attempt to contribute to that 
connection. Our conclusions will present both theoretical and practical aspects. Roughly 
speaking, our aim is to prove that in the held of supervised functional classihcation, there 
are many useful underlying models (dehned in terms of appropriate stochastic processes) 
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for which the expression of the optimal rule can be explicitly given. This will also lead to 
a natural procedure for variable selection in these models. We are also able to shed some 


light on the interesting phenomenon of “near perfect classihcation”, discussed by Delaigle 


and Hall (2012). This phenomenon does not appear (except for trivial or artihcial cases) in 


the classical hnite-dimensional classihcation theory. 

1.1 The framework: supervised classification and absolute continuity 

We are concerned here with the problem of binary functional supervised classihcation. 
Throughout the paper X = X{t) = Xt = X(t,oj) will denote a stochastic process with 
t E I, for some compact interval I. Unless otherwise specihed we will assume I = [0,T], 
with T > 0. This process can be observed in two populations identihed by the random 
“label” variable Y] the conditional distributions of X\Y = i for i = 0,1, denoted by Pi, are 
assumed to be Gaussian. 

As usual in the supervised classihcation setting, the aim is to classify an “unlabelled” 
observation X according to whether it comes from Pq or from Pi. A classihcation rule is just 
a measurable function : X —>• {0,1}, where X is the space of trajectories of the process X. 

The expression Pi << Pq indicates that Pi is absolutely continuous with respect to 
Pq (i.e. Po(A) = 0 entails Pi(A) = 0). Note that, from the Hajek-Feldman dichotomy for 


Gaussian measures (Feldman, 1958), Pi << Pq implies also Pq << Pi, so that both measures 
are in fact mutually absolutely continuous (or “equivalent”). This is often denoted Pi ~ Pq. 

When Pq and Pi are completely known in advance and Pi << Pq, it can be shown that 
the optimal classihcation rule (often called Bayes rule) is 


g [x) ^{ri{x)>l/2} If dPi{x) l-p l , 

l dPo ^ p / 


( 1 ) 


where I denotes the indicator function, r]{x) = P(U = 1|X = x) = E(U|X = x), p = P(U = 
1) and is the Radon-Nikodym derivative of Pi with respect to Pq. The corresponding 
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minimal “classification error” (i.e., the misclassification probability) L* = P( 5 f*(X) 7 ^ Y) is 


called Bayes error; see, e.g., Devroye et al. (1996) for general background and Bafllo et al. 


( 2011 a) for additional details on the functional case. 

If the Radon-Nikodym derivative is explicitly known, there is not much else to be 

said. However, in practice, this is not usually the case. Even if the general expression of 
is known, it typically depends on the covariance K{s,t) = Cov(X(s), X(t)) and 


mean 


functions mi{t) = E(X(t)|F = i). 

The term “supervised” accounts for the fact that, in any case, a data set of “well- 
classihed” independent observations = ((Xi, Yi),..., (X„, F^)) from (X, F) is assumed 
to be available beforehand. So, the classihcation rules are in fact constructed in terms of 
the sample data !D„. Throughout the paper, the functional data X = X(t) are supposed 
to be “densely observed”; see, e.g., Cuevas| ( |2014 , Sec 2 . 1 ). A common strategy is to use 
these data to estimate the optimal rule ([^. This is the so-called plug-in approach. It is 
often implemented in a non-parametric way (e.g., estimating ri{x) by a nearest-neighbour 
estimator) which does not require much information on the precise structure of r]{x) or 
However, in some other cases we have a quite precise information on the structure of 
so that we can take advantage of this information to get better plug-in estimators of g*{x). 


1.2 Some especial characteristics of classification with functional data. The aims of this work 

It can be seen from the above paragraphs that the supervised classihcation problem can be 
stated, with almost no formal difference, either in the ordinary hnite-dimensional situation 
(where X takes values on the Euclidean space X = M'^) or in the functional case (where X is a 
stochastic process). In spite of these formal analogies, the passage to an inhnite-dimensional 
(functional) sample space X entails some very important challenges. For example, the clas¬ 
sical Fisher linear rule, which is still very popular in the hnite-dimensional setting, cannot 
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be easily adapted to the functional case (see, Baillo et al (2011b) for more details and ref¬ 
erences). However, we are more concerned here with another crucial difference, namely the 
lack of a natural “dominant” measure in functional spaces, playing a similar role to that of 
Lebesgue measure in If we are working with Gaussian measures in the optimal rule 
([^ can be established (using the chain rule for Radon-Nikodym derivatives) in terms of the 
ordinary (Lebesgue) densities of Pq and Pi. In the functional case, we are forced to work 
with the “mutual” Radon-Nikodym derivatives dPi/dPo, provided that Pi << Pq. Usually 
these derivatives are not easy to calculate or to work with. However, in some important 
examples they are explicitly known and reasonably easy to handle. 

So hrst, we give and interpret explicit expressions for the optimal (Bayes) classihcation 
rule in some relevant cases with Pi << Pq. Similar ideas are developed in [Bafllo et al. 


(2011a) and Cadre (2013) but, unlike these references, our approach here relies heavily on 
the theory of Reproducing Kernel Hilbert Spaces (RKHS). See Sections and below. 

In the second place, we consider the mutually singular case PiPPq, i.e., when there 
exists a Borel set A such that Po{A) = 1 and Pi{A) = 0. Note that this mutually singular 
(or “orthogonal”) case is rarely found in the hnite-dimensional classihcation setting, except 
in a few trivial or artihcial cases. However, in the functional setting (that is, when Pi 
and Pq are distributions of stochastic processes) the singular case is an important, very 
common situation. As we argue in Section this mutual singularity notion is behind the 


near perfect classihcation phenomenon described in Delaigle and Hall (2012); see also Cuesta- 


Albertos and Dutta (2016). The point is to look at this phenomenon from a slightly diherent 


(coordinate free) RKHS perspective. We also show that an approximately optimal (“near 
perfect”) classihcation rule to discriminate between Pq and Pi when Pi T Pq, can be obtained 
in terms of the optimal rules of a sequence of problems {Pq, P”) with P” << Pq. 

Third, in Section we propose an RKHS-based variable selection mechanism (RK-VS 


hereafter). Unlike other popular variable selection methods in classihcation (see, e.g., Berren- 
























this new proposal allows the user to incorporate, in a flexible way, dif¬ 
ferent amounts of information (or assumptions) on the underlying model. We also provide a 
closely related linear classifier denoted henceforth by RK-C. As shown in Section]^ both the 
variable selection method and the associated classifier perform very well and are clearly com¬ 
petitive compared to several natural alternatives. We also argue, as an important additional 
advantage, the simplicity and ease of interpretation of the RKHS-based procedures. 

All proofs and some details about de simulation models are given in the Supplementary 
material document. 


dero et al. (2016b 


2 Radon-Nikodym densities for Gaussian processes: some background 

In the following paragraphs we review, for posterior use, some results regarding the explicit 
calculation of Radon-Nikodym derivatives of Gaussian processes in the convenient setting 
provided by the theory of Reproducing Kernel Hilbert Spaces. 


2.1 RKHS 


We first need to recall some very basic facts on the RKHS theory; see Berlinet and Thomas 
Agnan (2004), Janson ( |1997 , Appendix F) for background. 

Given a symmetric positive-semidefinite function K{s,t), defined on [0,T] x [0,T] (in 
our case K will be the covariance function of a process), let us define the space “Kq^K) of 
all real functions which can be expressed as finite linear combinations of type ^•ajiF(-,tj) 
(i.e., the linear span of all functions K{-,t)). In “KoiK) we consider the inner product 
(/,5 ')k = Eij where f{x) = X)* f^) and g{x) = Y.j (^jK{x, Sj). 

Then, the RKHS associated with K, TC(iF), is defined as the completion of IKq{K). More 
precisely, CR(iF) is the set of functions / : [0,T] —)■ M which can be obtained as t pointwise 
limit of a Gauchy sequence {/n} of functions in fl{o(iF). The theoretical motivation for this 


definition is the well-known Moore-Aronszajn Theorem (see Berlinet and Thomas-Agnan 
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(2004), p. 19). The functions in CK(i^) have the “reproducing property” f{t) = {f,K{-,t))K- 
If {Xt,t G [0,T]} is an L^-process (i.e. IE(X^^) < oo, for all t) with covariance function 
K{s,t), the natural Hilbert space associated with this process, T(X) is the closure (in 
of the linear span T(X) = t G [0,T]). The so-called Loeve Representation Theorem 


(Berlinet and Thomas-Agnan, 2004, p. 65) establishes that the spaces T(X) and IK(A') are 


congruent. More precisely, the natural transformation 4/(^. OiXtJ = aiK{-, ti) defines in 
fact, when extended by continuity, a congruence (that is an isomorphism which preserves the 
inner product) between T(X) and ‘K{K). Two interesting consequences of Loeve’s result are: 
first, if a linear map cj), from -£(X) to IK(A'), fulfils E,{(j)~^{h)Xt) = h{t), for all h G TC(X), 
then 0 coincides with the congruence T which maps Xt to K{t, ■). Second, IK(A') coincides 
with the space of functions of the form h{t) = E,{XtU), for some U G '£(X). 

Thus, in a very precise way, TC^K) can be seen as the “natural Hilbert space” associated 
with a process {X(t),t G [0,T]}. In fact, as we will next see, the space d£(X) is deeply 
involved in some relevant probabilistic and statistical notions. 


2.2 RKHS and Radon-Nikodym derivatives. Parzen’s Theorem 


The following result is a slightly simplihed version of Theorem 7A in Parzen (1961); see also 


Parzen (1962). It will be particularly useful in the rest of this paper. 


Theorem 1. (Parzen, 19611 Th. lA). Let us denote by Pi the distribution of a Gaussian 


process {X{t), t G [0,T]}, with continuous trajectories, mean function denoted by m = 
m(t) = E(X(t)) and continuous covariance function denoted by K{s,t) = Cov{X(s), X(t)). 
Let Pq be the distribution of another Gaussian process with the same covariance function 
and with mean function identically 0. Then, Pi « Pq if and only if the mean function m 
belongs to the space IK(iP). In this case. 


dPi{X) f.^ . 1 , , 

= exp UX,m)K- - (m ,mjx 


( 2 ) 
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In the case m ^ “K^K), we have Pi±Po- 


Some remarks on this result. 

(a) Note that, except for trivial cases, the trajectories x of the process X{t) are not 


included, with probability one, in 1K(P); see, e.g., (Berlinet and Thomas-Agnan, 2004, p. 


66 ) and Lukic and Beder ( 2001| ) for details. Thus, the expression {X,m)K is dehned a.s. 
as the random variable where 4/“^ is the inverse of the above dehned congruence 

T : —)■ TC(P) which maps Xt to K{t,-). This dehnition of {X,m)K in terms of a 

congruence, is strongly reminiscent of the dehnition of the Ito’s stochastic integral. 

(b) As a matter of fact, {X, m)K can be seen as a stochastic integral. To see this consider 
the classical case where X{t) = B{t) is the standard Brownian Motion, K{s,t) = min(s,t). 
Then, it can be seen that tK{K) coincides with the so-called Dirichlet space D[0,T] of those 
real functions g on [0,T] such that there exists g' almost everywhere in [0,T] with g' G 


L^[0,T], and ( 7 (t) = f*g'(s)ds. The 


norm in D[0,T] is dehned by \\g\\K = ( Jq 


1/2 


Likewise, the inverse congruence {X,m)K can also be expressed as the stochastic integral 
lo rn\s)dB{s). 

Thus, Theorem can be seen as an extension of the classical Cameron-Martin Theorem 
( Morters and Peresj 2010, p. 24), which is stated for X{t) = B{t). It also coincides with 
Shepp ( 1966| Th. 1), when applied to the homoscedastic case in which Pq and Pi are the 
distributions of X{t) and m{t) + X{t), respectively. 

(c) Some additional references on Radon-Nikodym derivatives in function spaces are 
VarbergI (1961, 1964), Kailath (1971) and Segall and Kailath (1975), among others. 
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3 Classification of absolutely continuous Gaussian processes 


In this section we consider the supervised classification problem, as stated in Subsection 1.1 
under the following general model 


Po : rno{t) + eo(t) 
Pi: mi{t) + ei{t) 


( 3 ) 


where, for i = 0,1, {ej(t), t E 1} are “noise processes” with mean 0 and continuous trajec¬ 
tories, and mi{t), t E I are some continuous functions defining the respective “trends” of Pq 
and Pi- We will take / = [0,T] unless otherwise stated. 

The following result provides the expression of the Bayes (optimal) rule and the corre¬ 
sponding minimal error probability for this case, under the usual assumption of homoscedas- 
ticity. While the proof is a simple consequence of Theorem and Theorem 1 in Bafllo et al. 


(2011a), this result will be essential in the rest of the paper. 


Theoreui 2. In the classification problem under the model assume 

(a) the noise processes e* are both Gaussian with continuous trajectories and common con¬ 
tinuous covariance function K{s,t). 

(b) m := mi — mo E tK{K), where 0-C{K) denotes the RKHS associated with K. 

Then, the optimal Bayes rule is given by g*{X) = I{,,*(x)>o}; where 


g*{X) = {X - mo,m)K - ^ II "i \\k - log (-—- 

2 \ p 


(4) 


and II • llii- denotes the norm in the space TC^K). 

Also, the corresponding optimal classification error L* = 


zs 


L* = (l-p)<h 


m \\k 


m \\k 


log 


p 




m \\k 1 


m \\k 


log 


p 


p 












where $ is the cumulative distribution function of a standard normal random variable. When 
p = 1/2, we have L* = 1 — ^ ■ 

If we compare this result with he optimal rule given for a similar problem in Theorem 
1 of the paper Delaigle and Hall (2012), we see that (|^ does not explicitly depends on 
the eigenvalues and eigenvectors of the covariance operator. As a counterpart, the general 
expression expression (|^ is given in terms of the “stochastic integral” {X,m)K- We will 
comment on this in more detail in the next section. 


4 Classification of Gaussian processes: another look at the “near perfect clas¬ 
sification” phenomenon 

The starting point in this section is again the classihcation problem between the Gaussian 
processes Pq and Pi defined in ([^, where Cq and Ci are identically distributed according 
to the Gaussian process e{t) with covariance function K{s,t) = E(e(s)e(t)). The mean 
functions are mo(t) = 0 and mi(t) = where the (fj are the eigenfunctions of 

the Karhunen-Loeve expansion of K, that is K{s,t) = ^j4>j{s)4>j{t). 

Let us assume for simplicity that the prior probability is P(H = 1) = 1/2. This model 


has been considered by Delaigle and Hall (2012). In short, these authors provide the explicit 
expression of the optimal rule under the assumption addition, they 

find that, when ~ classification is “near perfect” in the sense that one 

may construct a rule with an arbitrarily small classification error. To be more specihc, the 
classihcation rule they propose is the so-called “centroid classiher”, T„, dehned by T„(X) = 1 
if and only if D‘^{X,Xi) — D‘^{X,Xo) < 0, where Xq, Xi denote the sample means of the 
training data from Pq and Pi and D{X,Xj) = \{X,'ip)i 2 — {Xj,'ip)i 2 \, with {X,'ip)i 2 = 
X(t)'il)(t)dt and ip{t) = Of course, this requires ip ^ L‘^ which (from 

Parseval’s identity) amounts to ^ Then, the asymptotic version of the 
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classifier T„ under the assumed model is 


r°(X) = 1, if and only if ((X,?/>)i 2 - - (X,V’)i 2 < 0. (5) 


Now, a more precise summary of the above discussion is as follows. 


Theorem 3. (Delaigle and Hall. 2012. Th.l). Let us consider the binary classification 
problem ^ under the Gaussian homoscedastic model with mo{t) = 0 and continuous K. 


(<^) If Vj < oo, the minimal (Bayes) mis classification probability is given by 


< 


ervQ = 1 —$ ■ Moreover, under the extra assumption 

oo, the optimal classifier (that achieves this error) is the rule T° defined in 


(b) If = oo, the minimal mis classification probability is err^ = 0 and it is 

achieved, in the limit, by a seguence of classifiers constructed from T° by replacing the 
function ip with ipl^'^ = YYfj=i r 'I' oo. 


As pointed out in Delaigle and Hall (2012), “We argue that those [functional classifica¬ 
tion] problems have unusual, and fascinating, properties that set them apart from their finite 
dimensional counterparts. In particular we show that, in many guite standard settings, the 
performance of simple [linear] classifiers constructed from training samples becomes perfect 
as the sizes of those samples diverge [...]. That property never holds for finite dimensional 
data, except in pathological cases. ” 

Our purpose here is to show that the setup of Theorem (that is. Theorem 1 in Delaigle 


and Hall (2012)) can be analysed from the point of view of RKHS theory. We do this in 
Theorems |4] and |5] below. 


Theorem 4. In the framework of the classification problem considered in Theorem with 
continuous trajectories and continuous common covariance function K, we have 
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(^) E,>1^7 V? < oo if and only if Pi ~ -Pq- that case, the Bayes rule g* is 


g*{X) = 1 if and only if {X, m)K ~ |||'> 0, 


( 6 ) 


with the notation of Equation 0. The corresponding optimal (Bayes) classification 
error is L* = 1 — $ (|| m \\k /2). Under the additional condition Ej>i < oo? the 

optimal rule given in Theorem^ (a) provides an alternative expression of ^ based on 
the “coordinates” 9j and fVj. 

(b) Ej>i = 00 */ only if Pi-LPq. In this case the Bayes error is L* = 0. 

We next make explicit the meaning of the near perfect classihcation phenomenon. The 
next theorem establishes that in the singular case (where the Bayes error is zero) we can 
construct a classihcation rule whose misclassihcation probability is arbitrarily small. 


Theorem 5. Let us consider the singular case analyzed in Theorem Then, there is a 
sequence of approximating classification problems, of type Pon vs. Pin, corresponding the 
absolutely continuous case Pq^ ~ Pin, such that Pin converges weakly to Pi, for i = 0,1 
as n ^ 00 and the mis classification probabilities of the respective optimal rules (which are 
explicitly known) tend to zero. 


Now, we are in position to comment the contributions of the above Theorems and 
from the perspective of Theorem 1 in Delaigle and Hall ( 2012[ ) (see Theorem above for 
a slightly simplihed version). First, Theorem]^ is, in some sense, analogue to the Delaigle- 
Hall’s result. In the absolutely continuous case. Theorem]^ (a) provides a completely general, 
coordinate-free expression for the Bayes rule. It only requires the condition Ej>i < 00 

which is minimal in the sense that it amounts to Pq ~ Pi. Moreover, under the Delaigle- 


Hall’s assumption Ej>i such Bayes rule can be expressed in “elementary terms” 

with no resort to the stochastic integral {X,m)K which appears in ([^. This highlights an 
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interesting contribution of Theorem 1 in Delaigle and Hall (2012) which remains “hidden’" 


unless the whole problem is considered from the RKHS point of view. 

Theorem (b) and shed some light on the “near-perfect” classihcation phenomenon in 
two specihc aspects. First, Theorem|^(b) shows that Delaigle-Halhs condition J2j>i ~ 

oo has a probabilistic interpretation in terms of mutual singularity of measures. Second, 
Theorem shows that the classihcation problem in this singular case can be arbitrarily 
approximated by a sequence of problems in the absolutely continuous case for which the 
Bayes rules are explicitly known. This establishes an useful link between the dual cases of 
singularity and absolutely continuity. 


5 A model-based proposal for variable selection and classification 

Variable selection methods are quite appealing when classifying functional data since they 
help reduce noise and remove irrelevant information. Classihcation performance often im¬ 
proves if we only use their the functional data values at carefully selected points, instead of 
employing the whole trajectories. 

In this section we argue that the RKHS framework ohers a natural setting to formalize 
variable selection problems. The ability of RKHS to deal with these problems is mainly 
due to the fact that, by the reproducing property, the elementary functions act as 

a sort of Dirac’s deltas. By contrast, the usual L^[0,T] space lacks functions playing a 
similar role. Thus, we propose a RKHS-based variable selection method which is motivated 
by the expressions of Radon-Nikodym derivatives and optimal rules we have derived in the 
previous sections. We will also see that our method for identifying the relevant points has 
an associated classihcation rule which is consistent under some simple assumptions. 
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5.1 The proposed method 


We deal here with the functional supervised classification problem under the model ([^, 
assuming that the error processes eo and ei are Gaussian and homoscedastic. If we are 
willing to use a variable selection methodology, our aim would be to choose suitable, in¬ 
formative enough, points ti,... ,td in order to perform the classihcation task using just the 
d-dimensional marginal (X(fi),... ,X{td)). Assume in principle that d is hxed. Then, the 
natural question is: what is the optimal choice (fj,..., t*^) for (ti,... ,td)^ 

The answer is simple if we note that under the assumed model the conditional distribu¬ 
tions (A(ti),... ,X{td))\Y = i, for i = 0,1, are Gaussian and homoscedastic with a common 
covariance matrix whose i,j entry is K{ti,tj). Let us denote by such covariance ma¬ 

trix. Thus, after variable selection, the classihcation task based on {X{ti),... ,X{td)) boils 
down to a standard d-variate discrimination problem between two d-variate normal popula¬ 
tions. Let us denote by — (uii(G), • • • ,'uri(G)) — (mo(ti),... ,mo(td)) the difference 


between both mean vectors. It is well-known (Izenman, 2008, p. 244) that the optimal 


misclassihcation probability (Bayes error) in such a classihcation problem is a decreasing 

function of the Mahalanobis distance between both mean vectors, mj^ 

where vJ denotes the transpose of u. As a consequence, the criterion for variable selection 

follows in a natural way: we should choose (f^,...,f^) maximizing 

over a suitable domain. 

The theoretical results in this section hold when we look for the maximum within a 
compact domain 0 C [0,such that Kt^^,,,dd nonsingular for all (G ,... ,td) E Q {so that 
makes sense). For example, given d > 0, the domain 


0 = 0(d) = {(t^, __^td) e [0,TY : t(i) + d < f(i+i), for i = 0,... ,d - 1}, 


where f(i), i = 1,... ,d, denote the ordered values (with t(o) := 0) that fulhl the required 


13 





conditions if the finite-dimensional distributions of the process X are not degenerated. The 
value of S can be chosen as small as desired so that the restriction to Q{S) is not relevant in 
practice when we can observe the trajectories at a dense enough sample of points. 

Denote '0(fi5 ■ ■ ■ ,td) ■= criterion for variable selection is the 

following: choose {tl,... ,t*^) G 0 such that 

..., frf) > ...,td), for all {ti,...,td) e 0. (7) 


Since m and K are usually unknown, we propose to replace them by appropriate estimators 
and (more on this below). The criterion we suggest for variable selection 

in practice is to choose points (ti,... ,td ) G 0 such that 'ifjiti,... ,td) > 'ipiti,... ,td) for all 
{ti,... ,td) G 0, where 

...,td) := (8) 


We will denote this variable selection method by RK-VS (RK comes from “reproducing 


kernel”, in view of the RKHS interpretation we will give in subsection 5.2). 


On the estimation ofm and K. In principle (unless some strong parametric assumptions 
are made), the estimation of m = mi —mo will be done in the simplest way, using the sample 
means, i.e., m = mi{t) — mo{t), where rhjit) := n~^ for j = 0,1. 

The estimation of K might look as a more delicate issue. It is well-known that in 
some functional data analysis techniques (including functional linear regression and principal 
components analysis) there is a need to use smooth estimators of the covariance operator 


K] see, for example, Cuevas (2014, Secs. 5.2 and 7.1). Of course, such smoothed estimators 
could also be applied here but the underlying (functional) reasons to use them are not 
present in this case since in fact we are only concerned with the covariance matrices Kt^^...^td 
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of finite dimensional projections {X(ti),... ,X(td))- Thus, unless otherwise stated, we will 
estimate the natural empirical counterpart Kti,...,td constructed from the sample 


covariances. This has been the method we have used (with overall good results) in our 
empirical studies. Again a natural alternative to such estimators would arise in those cases 
in which we are assuming a precise parametric model, such as for example a Brownian motion 
for which K{s,t) = K{6,s,t) = 6*min(s,f) depending on an unknown parameter 6. In such 
models one could naturally consider parametric estimations of type K{6,s,t). 

Some further practical issues associated with the use of the RK-VS method will be consid¬ 


ered below in Subsection 5.4 Before that, we are going to study the functional interpretation 
of these methodology. 


5.2 An interpretation in functional terms 

Let us focus again on the homoscedastic Gaussian functional classification problem ([^ in 
the absolutely continuous case. According to Theorem Pq ~ Pi entails m G ‘K(K), where 
m = nil — Now assume that 


d 

m{-) = (9) 

i=l 

for some d G N, Oj G M, L G [0,T], i = 1,... ,d. Note that, for d large enough, a hnite 
linear combination of type ([^ would be a good approximation for the true value of m. This 
makes sense since, from the definition of the RKHS space 'K{K), the set ‘Kii{K) of such finite 
linear combinations is dense in !K(K). So the homoscedastic classification problem (|^ with 
m G ‘Ko{K) can be seen as an approximation to the general problem with m G d£(iL). 

Let us now recall that, from Theorem]^ the optimal rule to classify a trajectory x between 
Pq and Pi (with Pq ~ Pi) is g*{x) = II{,7*(a;)>o}5 where r]*{x) is given in Equation (|^. If m 
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has the form indicated in ([^, the discriminant score r]*{x) is given by 


d 1 

T]*{x) ={x - mo,'^aiK{-,ti))K - - || '^aiK{-,ti) \\% - log 

i=l i=l 

d ^ d d 

= ^ai{x{tj) - mo{tj)) - - log 

i=l i=l j=l 

where we have used the reproducing property to obtain the last equality. 

A more familiar expression for the optimal rule is obtained taking into account that (|^ 
implies the following relationship between ai,..., and ti,... ,td: 



• • • 1 1 


( 10 ) 


Now, using (10) we can write 



modi) + ^ 1 -p 'j 


( 11 ) 


which exactly coincides with the discriminant score of the optimal (Bayes) rule for the hnite 
dimensional discrimination problem based on the d-dimensional marginals {X (ti),... ,X (td)). 
Note also that if m is given by ([^ then 

d d 

W^Wk = ^^aiajK{tidj) = ml,...,td^dl..dd^H,...,td 

i=l j=l 

We now summarize the previous discussion in the following statement. 


Proposition 1. Let us consider the functional classification problem of discriminating be¬ 
tween the processes Pq and Pi with continuous mean functions and continuous trajectories 
of type X{t) '■= mi{t)+^i{t)! ^ ^ [0; T], where the e* are independent Gaussian non-degenerate 
processes with mean 0 and common continuous covariance function K{s,t). Then, 
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(a) the d-dimensional classification problem of discriminating between Pq and Pi on the 

sole basis of the projections {X(ti),... ,X(td)) at given points ti,... ,td is eguivalent (in the 
sense of having the same optimal rule and Bayes error) to the functional problem stated in 
the previous paragraph whenever m := mi — mo has the form m{-) = Yli=i U)- 

(b) Denote by covariance matrix of (X(ti),... ,X(td)) and let be the 

difference between both mean vectors. The Mahalanobis distance between the distributions 
{X{ti),..., X{td))\Y = i for i = 0, 1, given by coincides with \\m\\]^, 

the norm of m in the RKHS induced by K, provided again that m(-) = Ylf=i<yiK(-,ti). 

(c) The optimal choice for {ti.,... pd), in the sense of minimizing the classification er¬ 
ror, is obtained by maximizing limin' among all functions m in the RKHS space having an 
expression of type m(-) = CiiK{-,ti). 

At this point, one might wonder about the role of the assumption m(-) = cxiK{-, ti). 
The natural question is: to what extent such condition is needed in our approach to variable 
selection? In this respect, it is particularly important to note that the method dehned in ([^, 
still makes sense even if such assumption is not fulfilled] in that case, the method provides 
(asymptotically) the best choice (X(t^),... ,X(t*^)) of the chosen number d of variables in or¬ 
der to obtain a maximal separation in the Mahalanobis distance for their mean vectors under 
Pq and Pi. Note that, in principle, this idea could be considered without any assumption on 
the functional model (except, perhaps, homoscedasticity). The contribution of Proposition 
is just to establish in precise terms the conditions on the functional classification model 
under which the proposed variable selection procedure will be (asymptotically) optimal; see 
Theorem |6] below. 

5.3 The RK-based classification rule: consistency 

The above described RK-VS variable selection method has an associated classihcation rule 
which is just the classical Fisher’s linear rule for the discrimination problem based on the 
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RK-VS selected variables (X(ti),... ,X{id)). This classifier will be denoted RK-C. 

The following result shows that the estimation procedure in the definition of the RK-C 
method works, in the sense that the performance of the classification procedure implemented 
with the variables corresponding to the estimated points ti,... ,td tends, as the sample size 
increases, to that achieved with the optimal points tl,... defined in equation ([^. This 
is next formalized. Let us consider again our functional supervised classification problem 
under the conditions stated in the first paragraph of PropositionLet L* = P( 5 f*(X) ^ Y) 
be the misclassification probability obtained with the RK-C classifier, when both m and K 
are known and we use the “ideal” variable selection choice (X(t^),... ,X{t^)). Denote by 
Ln = P(^(X) 7 ^ Y\Xi,... ,Xn) the misclassification probabilities of Fisher’s rules defined in 
terms of {X{ii),..., X{id)) (see Equations ([^ and (|^ above). For the sake of simplicity 
consider p = 1/2. In this setup we have the following consistency result under fairly general 
conditions: 

Theorem 6. Consider the classification problem (with p = 1/2/ according to the model 
fort E [0,T]. Denote m{t) = rhiit) — mo(t), where rhjit) := ~ ^ji^) 

for j = 0,1, and let be the pooled sample covariance matrix, whose {i,j) entry is 

K . iAh]) = E,£(o,i) (it - xAumXrAW - xAtifi . 

Assume, 

(i) E||e|||oo < oo, for j = 0,1, where || • ||oo stands for the supremum norm. 

(a) The variable selection method is performed on a compact set 0 C [0,T]‘^. 

(Hi) is invertible for all (ti,... ,td) E Q and their entries are continuous on 0. 

Then, Ln -E L* a.s., as n ^ oo. 

Note that when the mean difference has the form m(-) = U), and (ti,... ,td) E 

0, from Proposition [T] we have that L* in Theorem coincides in fact with the Bayes error 
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in the original functional problem. Also, assumption (ii) entails that ti < ... < td for all 
{ti,... ,td) € 0. Note hnally that the same result would still be valid for other estimators of 
m and K as long as they are consistent uniformly on 0 (see the proof of Theorem [^in the 
Supplementary Material document). This will be typically the case when we may assume 
that the covariance operator is indexed by (and depends continuously on) a hnite-dimensional 
parameter 6, so that we only need to estimate 6. 


5.4 Some practical issues and computational aspects 

There are several aspects worth of attention in the RK-VS and RK-C procedures, as pre¬ 
sented in the previous subsections. 

First, the number d of points to be selected is assumed to be hnite. This can be seen as a 
reasonable approximation since, as mentioned above, the set of all hnite linear combinations 
is dense in the RKHS space ‘K{K) to which m is assumed to belong. Also, in 
many practical situations, the mean function m depends just on a hnite number of values 
ti- A simple example of this situation is as follows: consider model ([^ where Cq and ei 
are Brownian motions, mo = 0 and mi is a continuous, piecewise linear function such that 
mi(0) = 0. According to the computations above, the discriminant score of a trajectory x(t) 
only depends on the values of x at the points where mi is not diherentiable (and, possibly, 
also on x(0) and x{T)). This can be more easily derived from the representation of the 


discriminant scores in terms of stochastic integrals (see Subsection 2.2, remark (b)) 


Second, the matrix and the prior probability p may not be known either. Thus, 

p might be replaced by suitable consistent estimators Kti,...,td and p. The ap¬ 
propriate estimator depends on the assumptions we are willing to make about the 

processes involved in the classihcation problem. For instance, if all we want to assume is 
that they are Gaussian, we could use the pooled sample covariance matrix. However, under 
a parametric model, only a few parameters should be estimated in order to get see 
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Subsection 15.51 for more details on this. 


Third, • • •, is a non-concave function with potentially many local maxima so that 
the maximization process could be hard to implement even for moderately large values of d. 
Hence, in practice, we can use the following “greedy” algorithm. 

1. Initial step: consider a large enough grid of points in [0,T] and hnd ti such that 

^ ijit) when t ranges over the grid. Observe that this initial step amounts to 
hnd the point maximizing the signal-to-noise ratio since 

^ M -• 

for a suitable estimator of the variance at t. 

2. Repeat until convergence: once we have computed H,... ,0-i, hnd O such that 

..., id- 1 , id) > i>{ii, ■ ■ ■ ,id-i,t) for all t in the grid. 

Whereas we have no guarantee that this algorithm converges to the global maximum of 
... ,td), it is computationally ahordable and shows good performance in practice. 

5.5 An illustrative example. The price of estimating the covariance dunction 

The purpose of this subsection is to gain some practical insight on the meaning and perfor¬ 
mance of our RK methods. In particular, we will take into account that the RK methods can 
incorporate information on the assumed underlying model, via a known (or partially known) 
covariance function. In what follows we will assume that the data trajectories come from a 
Brownian Motion with diherent (unknown) mean functions. So we would incorporate this 
information in our “variable selection -|- classihcation” task by just using the, supposedly 
true, K{s,t), instead of its estimator in (|^. We will denote by RK^-VS and RK^-C the 
resulting “oracle” methods for variable selection and classihcation, respectively, implemented 
with K{s,t) = min{s,t}. 
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Of course, the assumption that K is known is too strong, but still it is useful to compare 
the performance of the oracle RK^-VS and RK^-C methods with the standard RK-VS and 
RK-C versions in which K{s,t) is estimated from the sample. In particular, we want to 
assess the loss of efficiency involved in the estimation of K{s,t). To this end, consider a 
simulated example under the general model ([^ in which Pq and Pi are Brownian motions 
whose mean functions fulhl m(t) = mi(t) — mo(t) = where t G [0,1], the a* 

are constants and the are continuous piecewise linear functions as those considered 

Morters and Per^ (2010, p. 28); they are obtained by integrating the piecewise constant 


m 


functions of a Haar basis. Explicit expressions can be found in the Supplementary Material 
document. In fact, it can be proved there that the form a orthonormal basis of the 

Dirichlet space I)[0,1] which, as commented above, is the RKHS space corresponding to this 
model. As a consequence, the equivalence condition in Theorem is automatically fulhlled. 
In addition, given the simple structure of the “peak” functions it is easy to see that the 
“sparsity condition” m(-) = J2i=i ^i) S'lso holds in this case. To be more specihc, in our 

simulation experiments we have taken mo(t) = 0 , mi{t) = $ 1,1 (t) — *h 2 ,i(t) + ^ 2 , 2 it) — "hs, 2(^)5 
and p = P(R = 1) = 1/2, so that the Bayes rule given by Theorem depends only on the 
values x(t) at t = 0, 1/4, 3/8, 1 / 2 , 3/4 y 1 and the Bayes error is 0.1587. Some typical 
trajectories are shown in Figure SI in the Supplementary Material document. 

Now, we analyze the performance of RK and RK^ in this example. The left panel of 
Figure [^shows the evolution of the classihcation error as the sample size increases for RK-C 
(blue line with circles), RK^-C (red line with diamonds), /c-nearest neighbor rule (kNN, gray 
line with squares) and the support vector machine classiher with a linear kernel (SVM, orange 
line with triangles). The last two rules are applied to the complete trajectories, without any 
variable selection. The dashed black line indicates the Bayes error. Each output is obtained 
by averaging 100 independent runs with test samples of size 200 ; for each sample size, the 
number of selected variables (RK-C and RK^-C), the number k of neighbours (kNN) and 
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Figure 1: Evolution of the classification error of RK-C and RK^-C in terms of the sample size (left panel) 
and the number of selected variables (right panel). 

the cost parameter (SVM) are set through a validation sample. The right panel of Figure 
shows the classihcation error in terms of the number of variables for RK-C and RK^-C 
for n = 500. Finally, Figure shows the frequency of selection of each variable among 
the hrst six (by construction, we know there are just six relevant points) corresponding to 
100 independent runs of RK-VS for three different sample sizes. The theoretical relevant 
points are marked by vertical dashed lines. So, to sum up, whereas Figure [T] summarizes the 
results in terms of classihcation performance. Figure]^ is more concerned with capacity of 
identifying the right relevant variables. 

These results are quite positive; RK-C seems to be a good estimator of the optimal 
classiher as the error rate converges swiftly to the Bayes error even when the number of 
variables is unknown and hxed by validation. Observe that the convergence seems to be 
slower for other standard classihers such as kNN and SVM (Figure]^ left plot). The right 
plot in Figure shows that for the true number of variables (six) the algorithm achieves 
the best performance. By contrast, a wrong choice of the number of variables can entail an 
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Figure 2: Histograms of the six first selected variables by RK-VS over 100 runs for sample sizes 50 (top 
panel), 200 (middle panel) and 1000 (bottom panel). 


important increase of the misclassification rate, so this is a sensitive issue. In addition, the 
selected variables (represented in Figure are mostly in coincidence with the theoretical 
ones. Even for small sample sizes, RK^-VS and RK-VS variables are grouped around the 
relevant variables. Only the variable V(0) is omitted since it is in fact nearly irrelevant. This 
good performance in detecting the important variables is in principle better than one might 
expect for a greedy algorithm (that, therefore might not provide the true global optimum). 
Note also that the inclusion of some additional information seems specially benehcial for 
smaller sample sizes. Finally, it is worth mentioning that the RK-based methods seem to be 
relatively inexpensive from the computational point of view. For example, the increase in the 
computation time as the sample size increases is much slower than that of other competing 
methods. See Figure S2 in the Supplementary Material document. 
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6 Experiments 


Our purpose in Section was twofold: we proposed both a variable selection method and an 
associated classiher. We check here the corresponding performances. 


6.1 Simulation study 

The simulation experiments include 94 models, previously considered in the studies by 
Berrendero et al. ( 2016bpi ). These models can be grouped into three classes. 

(i) Gaussian models: they are dehned via the marginal Gaussian distributions (Brownian- 
like, Ornstein Uhlenbeck,...) P* of X{t)\Y = i for z = 0,1. In all cases p = P(F = 1) = 1/2. 

(ii) Logistic-type models: they are dehned through the function ri{X) = P(F = l|X(t)) 

and the marginal of X. It is assumed that p{x) = (1 -|- with different 

choices for the link function T. 

(hi) Finite mixtures of different types of Gaussian models. 

Detailed descriptions of the 94 considered models can be found in the Supplementary 
Material document. We should emphasize that only 7 among these 94 models fulhll all the 
conditions imposed in our theoretical results. They are grouped under the label RKHS in the 
output tables of the Supplementary Material document. The remaining “unorthodox” mod¬ 
els aim at checking the behavior of our proposal when some departures from the assumptions 
are present. 

Training samples of sizes n = 30, 50,100, 200 are considered for each model. Sample 
trajectories are discretized in 100 equispaced points in the interval [0,1]. The criterion of 
comparison is the classihcation accuracy for an independent test sample of size 200. The 
number of selected variables as well as the classihcation parameters (if needed) are hxed in 
a validation step, using, for each test sample, another independent validation sample of size 
200. The hnal output is the average classihcation accuracy over 200 runs of this experiment. 
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Comparison of variable selection methods 

The primary aim of the study is to check the performance of our RK variable selection 


method against other dimension reduction procedures, chosen among the winners in Berren- 
dero et al. (2016b|a). To be specihc, these are the methods considered in the experiments: 


RK-VS, as dehned in 


RK^-VS, the “oracle” version RK-VS dehned in Subsection 5.5 by assuming that the 


common covariance structure coincides with that of the Brownian motion. Since this 
is not in general a realistic assumption, RK^ is included only for illustration purposes, 
just to check the price of the estimation in K{s,t) and the (sometimes surprising) 
resistance against the assumptions on the covariance structure. 

mRMR-RD: this is a modihed version of the popular minimum redundancy maximum 


relevance algorithm (mRMR) for variable selection proposed by Ding and Peng|(2005). 


The aim of mRMR is to select the subset S of variables that maximizes the difference 
rel(S') — red(S'), where rel(-) and red(-) are appropriate measures of relevance and 
redundancy which are dehned in terms of an association measure between random 
variables. The improved version of mRMR considered here (denoted mRMR-RD) 


has been recently proposed in Berrendero et al. (2016b). It relies on the use of the 


increasingly popular distance correlation (Szekely et al, 2007) association measure to 


dehne relevance and redundancy in the mRMR algorithm. 


MHR: the maxima hunting method (Berrendero et al, 2016a) also uses the distance 


correlation R‘^{t) = Ji‘^{X(t), Y), between X{t) and the binary response Y to select the 
points ti,... ,tk corresponding to the local maxima of E?{t). This automatically takes 
into account the relevance-redundancy trade-oh (though in a qualitative way, quite 
diherent to that of the mRMR methodology). 
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Table 1: Percentage of correct classification with the three considered classifiers 


Classifier Sample size Dimension reduction methods 


mRMR-RD PLS MHR RK-VS RKb-VS 


n = 30 

81.04 

82.87 

82.44 

81.50 

80.89 

n = 50 

82.37 

83.78 

83.68 

83.44 

82.54 

n = 100 

83.79 

84.70 

84.97 

85.30 

84.46 

n = 200 

84.88 

85.46 

85.90 

86.51 

85.90 


kNN 

n = 30 

81.88 

82.45 

82.46 

82.28 

81.92 


n = 50 

82.95 

83.49 

83.43 

83.75 

83.25 


n = 100 

84.31 

84.77 

84.73 

85.59 

84.95 


n = 200 

85.38 

85.79 

85.91 

87.16 

86.50 


SVM 

n = 30 

83.22 

84.12 

84.62 

84.28 

84.12 


n = 50 

84.21 

85.04 

85.44 

85.60 

85.20 


n = 100 

85.27 

86.03 

86.29 

86.96 

86.48 


n = 200 

86.10 

86.79 

86.86 

87.90 

87.50 


PLS: partial least squares, a well-known dimension reduction technique; see e.g. De- 


laigle and Hall (2012) and references therein. 


All these methods for variable selection (or, in the case of PLS, for projection-based 
dimension reduction) are data-driven, i.e., independent on the classiher, so we can combine 
them with different classihers. For illustrative purposes we show the results we have ob¬ 
tained with the Fisher linear classiher (LDA), k nearest neighbors (kNN) and support vector 
machine with a linear kernel (SVM). 


Some aggregated results are in Table Variable selection methods and PLS are in 
columns and each row corresponds to a sample size and a classiher. Each output is the 
average classihcation accuracy of the 94 models over 200 runs. Boxed outputs denote the best 
result for each sample size and classiher. The full results of the 1128 experiments (94 models 
X 4 samples sizes x 3 classihers) are available in the supplementary hie outputs. Additional, 
more detailed, summary tables are included in the Supplementary Material document. 

The results are quite similar for all considered classihers: RK-VS methodology outper- 
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forms the other competitors on average with a better performance for bigger sample sizes. 
Although RK-VS could have more difficulties to estimate the covariance matrix for small 
sample sizes, it is very close to MHR, which seems to be the winner in that case. Besides, 
the number of variables selected by RK-VS (not reported here for the sake of brevity; see 
Table S4 in the Supplementary Material) is comparable to that of mRMR-RD and MHR 
for kNN and SVM but it is about half of the number selected by mRMR-RD and MHR for 
LDA (the number of PLS components is often smaller but they lack interpretability). Note 
that, according with the available experimental evidence ( |Berrendero et aX 2016a|b ), the 
competing selected methods (mRMR-RD, MHR and PLS) have themselves a good general 
performance. So, the outputs in Table are remarkable and encouraging especially taking 
into account that only 7 out of 94 models under study fulhl all the regularity conditions 
required for RK-VS. Note that, somewhat surprisingly, the failure of the “Brownian assump¬ 
tion” implicit in the RK^-VS method does not entail a big loss of accuracy with respect to 
the “non-parametric” RK-VS version. 


Comparison of classifiers 

We also assess the performance of the classihers RK-C and RK^-C; see the dehnitions in 


the hrst paragraphs of Subsections 5.3 and 5.5, respectively. The competitors are kNN and 


SVM (with linear kernel), two standard all-purpose classihcation methods. 

Table provides again average percentages of correct classihcation over 200 runs of the 
previously considered 94 functional models. The results are grouped by sample size (in 
rows). Classihcation methods are in columns. The full detailed outputs are given in the 
supplementary hie outputs. 


The diherence with Table SI is that, in this case, the classihers kNN and SVM are used 
with no previous variable selection. So, the original whole functional data are used. This 
is why we have replaced the standard linear classiher LDA (which cannot be used in high¬ 
dimensional or functional settings) with the LDA-Oracle method which is just the Fisher 
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Table 2: Average classification accuracy (%) over all considered models. 


n 

kNN 

SVM 

RK-C 

RKs-C 

LDA-Oracle 

30 

79.61 

83.86 

81.50 

80.89 

84.97 

50 

80.96 

85.01 

83.44 

82.54 

86.23 

100 

82.60 

86.20 

85.30 

84.46 

87.18 

200 

83.99 

87.07 

86.51 

85.90 

87.69 


Table 3: Average classification accuracy (%) for the models satisfying the assumptions of Th. 

n 

kNN 

SVM 

RK-C 

RKs-C 

LDA-Oracle 

30 

83.20 

87.29 

88.30 

89.95 

90.91 

50 

84.90 

88.81 

89.81 

90.69 

91.41 

100 

86.61 

89.88 

90.81 

91.18 

91.64 

200 

87.94 

90.48 

91.13 

91.30 

91.71 


linear classifier based on the “true” relevant variables (which are known beforehand since we 
consider models for which the Bayes rule depends only on a finite set of variables). Of course 
this classifier is not feasible in practice; it is included here only for comparison purposes. 

As before, RK-C results are better for higher sample sizes and the distances between 
SVM or LDA-Oracle and RK-C are swiftly shortened with n; and again, RK^-C is less 
accurate than RK-C but not too much. While the global winner is SVM, the slight loss of 
accuracy associated with the use of RK-C and RK^-C can be seen as a reasonable price for 
the simplicity and ease of interpretability of these methods. Note also that the associated 
procedure of variable selection can be seen as a plus of RK-C. In fact, the combination of 
RK-VS with SVM outperforms SVM based on the whole functional data. 

Table shows average percentages of correct classification over 200 runs of the subset of 
models among all seven models that satisfy the assumptions in Theorem which establishes 
the consistency of the procedure proposed in Section It is not surprising that for these 
models RK-C and RK^-C have a better performance than kNN and SVM. In fact the RK 
percentages of correct classification are very close to those of LDA-Oracle, which means that 








there is not much room for improvement under these asumptions. 


6.2 Real data 


We now study the RK-C performance in two real data examples. We have chosen the “eas¬ 
iest” and the “hardest” data sets (from the classihcation point of view) of those considered 
in 


Delaigle and Hall (2012). Given the close connections between our theoretical setting and 


that of these authors, this partial coincidence of data sets seems pertinent. 

Thus, we follow the same methodology as in the cited paper, that is, we divide the data 
set randomly in a training sample of size n {n = 30,50,100) and a test sample with the 
remaining observations. Then, the RK-C classiher is constructed from the training set and 
it is used to classify the test data. The misclassihcation error rate is estimated through 200 
runs of the whole process. The number of variables selected by RK-C is hxed by a standard 
leave-one-out cross-validation procedure over the training data. 

We consider the Wheat and the Phoneme data sets. Wheat data correspond to 100 
near infrared spectra of wheat samples measured from llOOnm to 2500nm in 2nm intervals. 


Following Delaigle and Hall (2012) we divide the data in two populations according to the 


protein content (more or less than 15) and use the derivative curves obtained with splines. 
For this wheat data the near perfect classihcation is achieved. Phoneme is a popular data set 
in functional data analysis. It consists of log-periodograms obtained from the pronunciation 
of hve different phonemes recorded in 256 equispaced points. We consider the usual binary 
version of the problem, aimed at classifying the phonemes “aa” (695 curves) and “ao” (1022 
curves). This is not an easy problem. As in the reference paper we make the trajectories 
continuous with a local linear smoother and remove the noisiest part keeping the hrst 50 


variables. More details and references on this data can be found in Delaigle and Hall (2012). 


Table shows exactly the same results of Table 2 in Delaigle and Hall (2012) plus 
an extra column (in boldface) for our RK-C method. Since we have followed the same 
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Table 4: Misclassification percentages (and standard deviations) for the classification methods considered 


in Table 2 of Delaigle and Hall (2012) and the new RK-C method 


Data 

n 


Classification rules 


CENTpci 

CENTpis 

NP 

CENTpcp 

RK-C 

Wheat 

30 

0.89 (2.49) 

0.46 (1.24) 

0.49 (1.29) 

15.0 (1.25) 

0.25 (1.58) 


50 

0.22 (1.09) 

0.06 (0.63) 

0.01 (0.14) 

14.4 (5.52) 

0.02 (0.28) 

Phoneme 

30 

22.5 (3.59) 

24.2 (5.37) 

24.4 (5.31) 

23.7 (2.37) 

22.5 (3.70) 


50 

20.8 (2.08) 

21.5 (3.02) 

21.9 (2.91) 

23.4 (1.80) 

21.5 (2.36) 


100 

20.0 (1.09) 

20.1 (1.12) 

20.1 (1.37) 

23.4 (1.36) 

20.1 (1.25) 


methodology, the results are completely comparable despite the minimum differences due 
to the ramdomness. CENTp^i and CENTp^^ stand for the centroid classifier ([^, where 
the function is estimated via principal components or PLS components, respectively. NP 
refers to the classifier based in the non-parametric functional regression method proposed by 

and CENTpcp denotes the usual centroid classifier applied to the 
multivariate principal component projections. The outputs correspond to the average (over 
200 runs) percentages of misclassification obtained for each method, sample size and data 
set. The values in parentheses correspond to the standard deviation of these errors. 

The results show that the RK-C classifier is clearly competitive against the remaining 
methods. In addition, there is perhaps some interpretability advantage in the use of RK-C, 
as this method is based in dimension reduction via variable selection so that the ’’reduced 
data” are directly interpretable in terms of the original variables. Let us finally point out 
that the variable selection process is quite efficient: in the wheat example, near perfect 
classification is achieved using just one variable; in the much harder phoneme example, the 
average number of selected variables is three. 

7 Conclusions 

We have proposed a RKHS-based method for both variable selection and binary classification. 
It is fully theoretically motivated in terms of the RKHS space associated with the underlying 


Ferraty and Vieu (2006 
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model. We next summarize our study of the RK methods in the following conclusions. 

a) The identification of the RKHS associated with a supervised classihcation problem rep¬ 
resents several important theoretical and practical advantages. Apart from providing 
explicit expressions of the optimal Bayes rule (via the corresponding Radon-Nikodym 
derivatives), the RKHS approach provides a theoretical explanation for the near perfect 
classihcation phenomenon in terms of the mutual singularity of the involved measures. 

b) Perhaps more importantly, the RKHS approach provides a theoretical scenario to mo¬ 

tivate the use of variable selection. Under the RKHS framework, the family of models 
fulhlling a hnite RKHS expansion for m of type m(-) = is dense in the 

whole class of considered models. Note also that, even if a hnite expansion is not 
exactly fulhlled, the method has a clear interpretation (see the comments after Propo¬ 
sition!^ as it looks for the “best” choice of (ti,..., td) under this approximated model. 
The point is that, in any case, the method is always motivated in population terms. 

c) The RKHS-based variable selection and classihcation procedures are quite accurate 
and computationally inexpensive with important advantages in terms of simplicity 
and interpretability. The simulation outputs show that RK-VS procedure is especially 
successful as a variable selection method. As a classiher RK-C is still competitive and 
especially good when the underlying assumptions are fulhlled. 

d) The empirical results show also a remarkable robustness of the RK methodology against 
departures from the assumptions on which it is based. 

Acknowledgements. Research partially supported by Spanish grant MTM2013-44045-P. 
Supplementary materials. The hie Supplementary.pdf includes all the proofs as well as 
some additional details, tables and hgures on the empirical results. The hie Outputs.xlsx 
gives the complete simulation outputs. 
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Supplementary material for the paper “On the use of reproducing 
kernel Hilbert spaces in functional classification” 


SI Proofs 


Proof of Theorem 2. Equation (4) follows straightforwardly from the combination of (1) and 
(2). To prove the expression for the Bayes error notice that (X — mQ,m)K lies in L{X — 
mo) and therefore the random variable ri*{X) is Gaussian both under Y = 1 and E = 0. 
Furthermore, Equations (6.19) and (6.20) in Parzen (1961) yield 


E{T]*iX)\Y = 0) = - II m 11^ /2 - log , 

EiT]*iX)\Y = 1) =11 m \\l /2 - log , 

Var(r 7 *(X)|y = 0) = XaT{r]*{X)\Y = 1) =|| m ||^ . 


The result follows using these values to standardize the variable ri*{X) in L* = (1 — 
p)F{r]*{X) > 0|F = 0) +pP(r/*(X) < 0|F = 1). □ 

Proof of Theorem f. Observe that, if 6j > 0 for all j > 1, 

OO OO 

j=i j=i 


where {\/^(fj '■ dj > 0} is an orthonormal basis of 'K(K) [see, e.g.. Theorem 4.12, p. 
61 in [Cucker and Zhou|| (2007)]. Then, by Parseval’s formula, mi G TC(i7) if and only if 
il Wr— Sfci < OO- As a consequence, we have the desired equivalence: 


Pi ^ Po ^ mi G TC(i7) y^ll mi ||ii'< oo 9j < oo. 

i=i 


Moreover, 


erro = 1 — <h 



what gives the coordinate-free expression of the Bayes error. 


Now, if we further assume (as in Delaigle and Hall (2012a)) that f) & the optimal 
classiher proposed by these authors (5) is equivalent to T°(X) = 1 if and only if 


(mi,?/>)^2 - 2(mi,^/>)L2(X,^/>)i2 < 0. 


(SI) 
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Since rrii = with rrii ^ 0, and -0 = have (mi,'^)i 2 = 


=11 lllcT^ 0- Therefore, (SI) holds if and only if 




L2 


\\k 


> 0 . 


To end the proof it is enough to show (X, mi)K = The linearity of (X, ■)k and the 

fact that 6j and 0j are respectively eigenvalues and eigenfunctions of the integral operator 
with kernel K imply 


(X, mi)x = Oj Vj {X, 0j^j)K = Oj / (X, K (•, u))K(j)j{u)d 

j=i j=i Jo 


u. 


Now, from Equation (6.18) in Parzen (1961), 


{X, K{-,u))K(j)j{u)du = / X{u)(j)j{u)du = {X,(j)j)L 2 . 


Finally, combining the two last displayed equations. 


{X,m,)K = 5 ^ 07 V.(X, 0 ,)i 2 = (X,5^0-V,0,)l2 = 

i=i i=i 


□ 


Proof of Theorem 5. Let X = fhe Karhunen-Loeve expansion of X, with the 

Zi uncorrelated. For a given trajectory x = Dehne x"' = This is a 

trajectory drawn from the process X" = X]r=i whose distribution under Pi is denoted 
by Pin (for i = 0,1, the covariance function is Kn{s, t) = and the mean 

function under Pin is 

n 

= ^IE(Zj)0i(t), 
i=l 


Note that, under Pq, IE(Zj) = 0, so that the mean function is 0. From Karhunen-Loeve 
Theorem (see Ash and Gardner (1975), p. 38) m„(f) —)■ m{t) for all t (in fact this results 
holds uniformly in f). 

Note also that rUn € TC{K). Again this follows from the fact that {\/9i(j)i : 6i > 0} is an 
orthonormal basis of ‘K{K) [see, e.g.. Theorem 4.12, p. 61 in Cucker and Zhou (2007)]. 

We now prove that we must necessarily have lim„ ||m„||;^ = cx). Indeed, if we had 
lim„ ||m„||^ < oo for some subsequence of {m„} (denoted again {m„}) we would have 
that such {rrin} would be a Cauchy sequence in 1K(X), since for q > p, \\mp — mq\\K < 
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~ ll^pllii:|- This, together with the pointwise convergence mn{t) —?• m{t) leads, from 


Moore-Aronszajn Theorem (see Berlinet and Thomas-Agnan (2004), p. 19) to m G TC(A'). 
But, from Parzen’s Theorem 1, this would entail Pi << Po, in contradiction with Pi T Pq- 
We thus conclude HmnllE- —t oo. 


Then, given e > 0, choose n such that 


(l-p)$ - 


rrin \\k 


'^n \\K 


log 


1 -p 
p 


I -IldZhlilA + log —P I I < e, 


\\K 


P 


(S2) 


Now, consider the problem X”' ~ Pin vs ~ P^n Note that X” ~ Pin if and only if 
X ~ Pi, for i = 0,1. Since G 1K(X„), we have Pon ~ Pin (using again Parzen’s Theorem 
1 ). 

Now, according to Theorem 2 (on the expression of the optimal rules in the absolutely 
continuous case under homoscedasticity), the optimal rule is gn{X) = I{,,„(x)>o}) where 


r]n{x) = {x,mn)K - ^ II III- - log —- 

2 \ p 


(S3) 


whose probability of error, is exactly the expression on the left-hand side of (S2). So this 
probability can be made arbitrarily small. □ 

Proof of Theorem 6. For the sake of conciseness, denote r := {ti,... ,1^), a generic element 
of 0, f := (fi,... ,td), and t* := (fl,... We will also use the following notation: for 
J = 0,1, 

- PrVKf^mT)'^ 


ilijir) : = 


mfK-^KnK-^rh, 


where mj^r '■= ijnjiti),... ,mj{td)Y ^^^1 fir = {jho,T + hii^T-)/2. With this notation it is not 
difficult to show that L* = 1 — <F(-?/>(r*)^/^/2), and 


2 [ 2 2 


' 0 i ( r )^/2 


where $ is the cumulative distribution function of the standard Gaussian distribution (to 


obtain these formulas we have used the arguments in Mardia et al. (1980) p. 321, for L*, 


and Fan and Fan (2008), p. 2609, for L„). Since <F is continuous, the desired conclusion will 
readily follow if we prove fijir) —)■ ifir*) as n —>■ oo, a.s., for j = 0,1. 

Since E||ej||oo < oo, for j = 0,1, Mourier’s Strong Law of Large Numbers (SLLN) for 


random elements taking values in Banach spaces (see e.g. Laha and Rohatgi (1979), p. 452) 
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implies 

sup llmr — mrll —)■ 0, as u —)■ cx), a.s. (S4) 

rG© 

Since E||e|||oo < oo for j = 0,1, Mourier’s SLLN also implies that the entries of converge 
uniformly to those of that is for j = 1,..., d, 

sup\Kr{i,j) — Kr{i,j)\ ^ 0, as n —)■ cxD, a.s. (S5) 

r£© 


Observe that 


,_i adi{Kr) 


k:^ = 


det {Kr) 

where adj(it") and det(it') denote the adjugate and the determinant of a matrix K, respec¬ 
tively. By (S5), the entries of adj(i^T-) converge uniformly to those of adj(i^T-), and det(i^,-) 
converges uniformly to det(i^T-). Moreover, inf,-g© det(i^T-) > 0 because det(i^T-) is continu¬ 
ous in r and, by assumption, det(i^T-) > 0, for all r G 0, where 0 is a compact set. As a 
consequence of all these observations. 


snp\K^^{i,j) — ^ 0, as n —)■ oo, a.s. 

rG0 


(S6) 


By (S4) and (S6), it also holds 


sup \\K^ Thr — rrirW 0, as u —)■ cx), a.s. 

rG0 


From this convergence, together with (S4), we deduce 


sup IV’(t) —'^(r)l —)• 0, as n ^ oo, a.s. 

t£© 


(S7) 


and 


sup t 0, as n —)■ oo, a.s. j = 0,1. 

tG© 


(S8) 


Due to (S7), with probability one, given e > 0 there exists N such that for n > N it holds 


tljir) — e < t/jir) < '4 >{t) + e, for all r G 0. Taking the maximum in these inequalities we get 
■0(7) — e < '^(r*) < '^(f) -|- e. That is, we have 


'ip{f) —)■ as n —>■ oo a.s. 


(S9) 


Finally, note that for j = 0,1, 

l^i(7) - < |'0j(r) - ^p{f)\ + \^p{f) - -0(7)1 + 1-0(7) - 0(r*)|. 
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Then, from (S7), (S8) and (S9) we get ipj{f) —)■ '0(r*) as n —>■ cx), a.s. for j = 0,1, as 
desired. □ 


S2 Models used in the simulation study 


The general structure is similar to that of the simulation studies in Berrendero et al. (2016a) 
and Berrendero et al. (2016b) which are devoted to the assessment of variable selection 
methods in the functional classihcation setting. Here we consider the 94 models for which 
the mean functions mo and mi are different. The optimal classification rule in each case 
depends only on a hnite number of variables. Models differ in complexity and number of 
relevant variables. They are defined giving either: 


(El) A pair of distributions for X\Y = 0 and X\Y = 1 (corresponding to Pq and Pi, 
respectively) as well as the prior probability p = P(y' = 1); in all cases, we take 
p = p(y = 1) = 1/2. 

(E2) The marginal distribution of X plus the conditional distribution t]{x) = P(E = 1|X = 
x). 


All the 94 considered models belong to one of the following classes: 


Gaussian models: they are denoted by G. Gaussian models are generated according to 
the general pattern (El). In all cases the distributions of X{t)\Y = i are chosen among one 
of the Gaussian distributions described below. 


Logistic models: they are defined through the general pattern (E2). The process X = 
X{t) follows one of the above mentioned distributions and Y ~ Binom(l, 77(X)) with 

a function of the relevant variables x{ti), • • • , x{td). The 15 versions and the few variants of 
this model considered are identified with the general label L. They correspond to different 
choices for the link function 4/ (both linear and nonlinear) and for the distribution of X. 

Mixtures: they are obtained by combining (via mixtures) the above mentioned Gaussian 
distributions assumed for X\Y = 0 and X\Y = 1 in several ways. These models are denoted 
by M in the output tables. 

The processes involved are chosen among the following: Erst, the standard Brownian 
Motion, B. Second, BT denotes a Brownian Motion with a trend m(t), i.e., BT{t) 
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= B(t) +m(t)] we have considered several choices for m(t), a linear trend, m(t) = ct, a linear 
trend with random slope, i.e., m{t) = 9t, where 6^ is a Gaussian r.v., and different members 
of two parametric families: the peak functions and the hillside functions, dehned by 




hillsideto,b(t) = b(t - fo)I[to,oo), 


where, (pm,k{^) = 


for m G N, 1 < /c < 2 


m—1 


Third, the 


I 1/* 2fc-2 2k^\ — \( 2k-l 2k ' 

i [ 2^ ^ 2^ J \ 2^ ’ 2^ , 

Brownian Bridge; BB{t) = B{t) — tB{l). Our fourth class of Gaussian processes is the 

Ornstein—Uhlenbeck process, with zero mean {OU) or different mean functions m{t) 

{out). Finally some “smooth” processes have been also included. They are obtained by 

convolving Brownian trajectories with Gaussian kernels. We have considered two levels of 

smoothing denoted by sB and ssB; in the list of models below those labeled ssB are smoother 

than those with label sB. 

In the following list of models, P* denotes the distribution of X|y = i and variables is 
the set of relevant variables in each Gaussian or Mixture case. We call them “relevant” in 
the sense that the optimal classification rule depends only on these variables. In the list 
below the variables written in boldface are “especially relevant” in terms of their relative 
discriminating capacity. 

All considered sample data are discretized in 100 equispaced points Xi,... ,Xi00 in the 
interval [0,1]. To avoid degeneracies we have excluded the point 0 and the point 1 in the 
Brownian Bridge type models. 


1. Gaussian models considered: 


1 G2 : 


Pq : B{t) + t 


Pi : B{t) 

variables = {Aioo}- 

Pq : B{t) + 3t 


2 . G2b 


Pi : B{t) 

variables = {Aioo}- 

/ 

Pq : B{t) + hillsideQ_5^4{t) 


3. G4 : 


^Pi ■■ B{t) 

variables = {A 47 ,Aioo}- 

Pq : P(t) + 3^i,i(t) 


4 G5 : 


Pi : 


B{t) 


variables = {Ai,X48, Aioo}- 


5. G6 : 


_ j Po : B{t) + 5$2,2(t) 


^Pi : Bit) 

variables = {A48,A’75, Aioo}. 

Pq : P(t) + 5<h3^2(t) + 54>3^4(t) 


6 . G7 : 


^Pi : B{t) 

variables = {A 22 ,A ^35 W 49 W 74 Wss Wioo}- 


7. G8: 


Pq : B{t) + 3<l>2,i.25(t) + 3<I>2,2(t) 

: B{t) 

variables = {A9,X35, A 48 , A62,X75, Aioo}- 
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2. LOGISTIC-TYPE MODELS UNDER STUDY: they are all defined according method (E2) (see Sec. 
6.1 in the main paper). The process X = X{t) follows one of the distributions mentioned above 
and Y = Binom(l, ry(X)) with rj{x) = (1 -|- g_ function of the relevant variables 

x{ti), ■ ■ ■ ,x{tk). 

LI: iA(X) = 10 X 65 . 

L 2 : V'(X) = 10X30 -|- 10X70. 

L 3 : i/:(X) = 10X30 — 10X70. 

L 4 : V'(X) = 20X30 -I- 50X5020X30. 


L 5 : ip{X) = 20X30 — 50X50 -|- 20X30. 

L 6 : ipiX) = 10X10 -|- 30X40 -|- 10X72 -|- 10X30 -|- 20X95. 
L 7 : lA(X) = 10X10*. 

L 8 : V'(X) = 20 X|o + 10 X|o + 50 X|o. 

L 9 : V'(X) = 10X10 -I- IOIX50I -I- 0 X|qX 35 . 

LIO: 'ip{X) = 20X33 -|- 20 |X 63 |. 


Lll: i^(X) = ^ + |^. 


L 12 : i/:(X) = log X35 -Flog X77. 

L 13 : ipiX ) = 40X20 + 30X28 -|- 20X52 + 10X57. 

L 14 : ' 0 (X) = 40X20 + 30X28 — 20X52 — 10X57. 

L 15 : i/'(X) = 40X20 — 30X28 + 20X52 — 10X57. 

Some variations of these models have been also considered: 
L 3 b: i/:(X) = 30X30 — 20X70. 

L 4 b: ip { X ) = 30X30 -|- 20X50 -|- 10X30. 

L 5 b: i/'(X) = 10X30 — 10X50 -I- 10X30. 
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L6b: — 20^10 “1“ 20^40 “1“ 20^72 + 20^80 + 20^95. 


L8b: = 10X|o + 10X|o + lOXfo- 


3. Mixture-type models: they are obtained by combining (via mixtures) in several ways the 
above mentioned Gaussian distributions assumed for X\Y = 0 and X\Y = 1. These models are 
denoted Ml, MIO in the output tables. 


1. M2 : 


Pq ■■ 


.Pi ■■ 


5(1) + 31-2.2(1), 

1/2 

fPo: 

5. M6 : < 

r 5(1)+ 31-2.1(1) 

,1/2 

5(1) + 51-3.2(1), 

1/2 


,1/2 

5(1) 


[Pl : 




variables = {X22,X35t ^48i^75i -^lool- 


variables = {A’i,X225 ^49)-X^iOo}- 


r r B(i) +31-2.2(1), 1/10 

: J ° ■ I B(i) +51-3.2(1), 9/10 

[Pi ■■ B{t) 

variables = {X22,X'35, X48,X75, Xioo}- 


1 

fPn 

r 5(1)+ 31-1,1(1) 

,1/2 

M7 : i 

|Po. 

' 1 BB{t) 

,1/2 

1 

[Pl : 

: 5(1) 


variables = 

{Xi,X48,Xioo}. 



\Po. 

3. M4: / 


R(t)+ 31-2.2(1), 1/2 
s(l) +51-3.3(1), 1/2 


Po: 

7. MS : < 


B(l) + 6»1, 6l~Af(0,5) ,1/2 

B{t) + hiUsideQ,^^^{t) ,1/2 


Pi : P(l) 


Pi : B(t) 


variables = {X48,Xg2,.X'75,2^loo}• 


variables = {X'47,Xioo}. 



r 5(1)+ 31-2,1(1) 

,1/3 



( 5(1)+ 31-1,1(1) 

,1/3 

Po : 

< 5(1) + 31-2.2(1), 

1/3 


Po : 

1 B(t) - 31 

,1/3 


( 5(1) + 51-3.2(1), 

1/3 

8. MIO : ■ 


1 BB{t) 

,1/3 

Pl ; 

B(t) 



.Pl : 

5(1) 



variables = {Xi,X22,X'35, X48,X75, Xioo}- 


variables = {Xi,X48,Xioo}. 


Finally, we consider here those models for which the mean functions tuq and mi are 
different (otherwise any linear method is blind to discriminate between Pq and Pi). The full 
list of models involved is as follows: 


1. LI OU 

5. LI ssB 

9 . L2 sB 

13. L3 out 

2 . LI out 

6. L2 OU 

10 . L2 ssB 

14. L3b out 

3. LI B 

7. L2 out 

11 . L3 OU 

15. L3 B 

4. LI sB 

8. L2 B 

12 . L3b OU 

16. L3b B 
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17. L3 sB 

37. L6 sB 

57. Lll OU 

18. L3 ssB 

38. L6 ssB 

58. Lll out 

19. L4 OU 

39. L7 OU 

59. Lll B 

20 . L4b OU 

40. L7b OU 

60. Lll sB 

21 . L4 out 

41. L7 0Ut 

61. Lll ssB 

22 . L4b out 

42. L7b out 

62. L12 OU 

23. L4 B 

43. L7B 

63. L12 out 

24. L4 sB 

44 . L7 sB 

64. L12 B 

25. L4 ssB 

45. L7 ssB 

65. L12 sB 

26. L5 OU 

46. L8 B 

66 . L12 ssB 

27. L5b OU 

47. L8 sB 

67. L13 OU 

28. L5 out 

48. L8 ssB 

68 . L13 out 

29. L5 B 

49. L8b OU 

69. L13 B 

30. L5 sB 

50. L9 B 

70. L13 sB 

31. L5 ssB 

51. L9 sB 

71. L13 ssB 

32. L6 OU 

52. L9 ssB 

72. L14 OU 

33. L6b OU 

53. LIO OU 

73. L14 out 

34. L6 out 

54. LIO B 

74. L14 B 

35. L6b out 

55. LIO sB 

75. L14 sB 

36. L6 B 

56. LIO ssB 

76. L15 OU 


77. 

L15 out 

78. 

L15 B 

79. 

L15 sB 

80. 

G2 

81. 

G2b 

82. 

G4 

83. 

G5 

84. 

G6 

85. 

G7 

86. 

G8 

87. 

M2 

88. 

M3 

89. 

M4 

90. 

M5 

91. 

M6 

92. 

M7 

93. 

M8 

94. 

MIO 


S3 Computational details 

All considered methodologies have been implemented in MATLAB. The code is available upon 
request. Some details: 


We have followed the implementation of the the minimum Redundancy Maximum Relevance 
algorithm given in Berrendero et al. (2016b). This version allows us to introduce different 
association measures. 


We have implemented the original iterative PLS algorithm that can be found, e.g. in Delaigle 


and Hall (2012b). 
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Maxima-hunting and the distance correlation measure have been computed as described in 


Berrendero et al. (2016a). 


Our A:-NN implementation is built around the MATLAB function pdist2 and allows for the use 
of different distances; we have employed the usual Euclidean distance. Also, the computation 
for different numbers of neighbours can be simultaneously made with no additional cost. 

Our LDA is a faster implementation of the MATLAB function classify. 

The linear SVM has been performed with the MATLAB version of the LIBLINEAR library 


(see Fan et al. (2008)) using the parameters bias and solver type 2. It obtains (with our data) 
very similar results to those of the default solver type 1, but faster. LIBLINEAR is much 
faster than the more popular LIBSVM library when using linear kernels. 

The cost parameter C of the linear SVM classifier, the number k of nearest neighbours in 
the fe-NN rule, the smoothing parameter h in MHR and the number of selected variables are 
chosen by standard validation procedures explained in Section 6. 


S4 Additional results 

In this section we include some supplementary outputs and graphs of practical interest as well as 
more detailed information about the simulation results: 

• Some trajectories of the toy example in Section 5.5 are displayed in Figure Left (right) 
panel shows trajectories from Pq (Pi) and thick solid lines represent empirical means. 


Figure S2 displays the computational cost (in seconds) for different sample sizes n in that 


example. Each point represents the sum of computation times of 100 experiments for each 
methodology and sample size with d = 200. The results have been obtained in a standard 
PC with processor Intel 17-3820, 3.60 GHz and 32GB RAM. Note that the considered kNN 


and SVM implementations are computationally efficient (see Section S3). 


Table SI is a complement for Table I by showing the average number of variables (or com¬ 


ponents) . 


Tables [S2 and S4 show the classification accuracy (percentage of correct classification) 
for different groups of models and methods obtained with LDA, kNN and SVM classifiers 
respectively. Results from the different considered classifiers are quite similar in relative 
terms. Let us recall that the full results of the 1128 experiments (94 models x4 samples 
sizes x3 classifiers) are available in the supplementary file outputs. The methods appear in 
columns; apart from methods in Table I we have included Base (except for LDA) and Oracle 
versions of each method. The first is based on the entire trajectories and Oracle only uses 
the true relevant variables. The simulation outputs are grouped in different categories (in 
rows) by model type and sample size n. The rows are labelled by the general model type, 
that is, logistic, Gaussian and mixtures. The logistic models are also divided by the type of 
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Figure SI: Some trajectories from the toy example i?(t) (left) vs i?(t) + <I>ip(t) — <i>2,i(t) +‘l’2,2(t) —‘l’3,2(t) 
(right). Thick solid lines correspond to the mean functions. 

processes involved according to the notation given above. RKHS denotes the models that 
fulfil the hypotheses of RK-VS (G2, G2b, G4,...,G8) and “All models” includes the outputs 
of all the 94 considered models for each n. We have followed the methodology described in 
the main paper and the outputs are averaged over 200 independent runs. The marked values 
correspond to the best performance in each row (excluding Oracle which is not feasible in 
practice). 
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Figure S2: Time cost (in seconds) of 100 runs of the experiment for each method and different sample 
sizes with d = 200 . 
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Table SI: Average number of selected variables (or components) with the three considered classifiers. 
Remember that the original dimension is 100 . 


Classifier Sample size Dimension reduction methods 


mRMR-RD PLS MHR RK-VS RKb-VS 


n = 30 

4.9 

2.6 

5.4 

2.7 

3.7 

n = 50 

5.9 

2.8 

6.1 

2.8 

4.1 

n = 100 

7.2 

3.3 

7.0 

3.2 

4.8 

n = 200 

8.1 

4.0 

7.5 

3.9 

5.6 


kNN 

n = 30 

7.8 

4.3 

6.2 

7.6 

8.1 


n = 50 

8.0 

4.8 

6.2 

7.3 

7.9 


n = 100 

8.4 

5.5 

6.2 

6.7 

7.6 


n = 200 

8.6 

6.2 

5.9 

6.3 

7.2 


SVM 

n = 30 

9.3 

3.3 

8.0 

9.3 

10.0 


n = 50 

9.4 

3.8 

7.9 

8.7 

9.6 


n = 100 

9.7 

4.6 

7.9 

8.0 
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n = 200 
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5.6 

7.5 

7.6 

8.9 
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Table S2: Percentage of correct classification with LDA 


Models 

n 

mRMR-RD 

PLS 

MHR RK-VS RKs-VS 

Base 

LDA-Oracle 

All models 

30 

81.04 

82.87 

82.44 

81.50 

80.89 

61.48 

84.97 


50 

82.37 

83.78 

83.68 

83.44 

82.54 

59.30 

86.23 


100 

83.79 

84.70 

84.97 

85.30 

84.46 

53.31 

87.18 


200 

84.88 

85.46 

85.90 

86.51 

85.90 

74.73 

87.69 


Logistic OU 30 

78.70 

80.11 

79.36 78.21 

76.47 60.32 

81.92 

50 

80.12 

80.96 

80.75 80.23 

78.33 58.05 

83.24 

100 

81.70 

81.90 

82.30 82.16 

80.69 52.83 

84.27 

200 

83.05 

82.74 

83.65 83.66 

82.61 71.79 

84.84 


Logistic out 30 

80.12 

81.30 

80.87 

79.60 

78.56 

61.10 

83.11 

50 

81.21 

82.05 

81.98 

81.42 

80.20 

58.80 

84.44 

100 

82.39 

82.91 

83.14 

83.14 

82.15 

53.04 

85.45 

200 

83.35 

83.51 

84.03 

84.29 

83.66 

73.54 

85.93 


Logistic B 

30 

82.79 

84.57 

84.19 

83.52 

82.32 

62.74 

87.54 


50 

84.18 

85.55 

85.59 

85.65 

84.21 

60.06 

88.83 


100 

85.74 

86.60 

87.16 

87.71 

86.47 

53.55 

89.90 


200 

86.88 

87.50 

88.33 

89.17 

88.18 

75.94 

90.51 


Logistic sB 30 

82.95 

84.63 

84.26 

83.43 

82.37 

62.87 

87.10 

50 

84.18 

85.59 

85.59 

85.39 

84.11 

60.74 

88.46 

100 

85.51 

86.60 

87.02 

87.52 

86.34 

53.17 

89.55 

200 

86.71 

87.38 

88.20 

88.84 

87.98 

75.73 

90.18 


Logistic ssB 30 

84.56 

85.73 85.58 

84.93 

84.51 

63.60 

86.54 

50 

85.65 

86.49 86.54 

86.42 

85.93 

60.68 

87.90 

100 

86.86 

87.25 87.38 

87.89 

87.39 

53.55 

88.81 

200 

87.83 

88.01 87.72 

88.83 

88.59 75.33 

89.38 


Gaussian 

30 

85.28 

88.63 

88.70 

88.30 

89.95 

62.56 

90.91 


50 

86.72 

89.45 

89.38 

89.81 

90.69 

61.24 

91.41 


100 

88.21 

89.91 

89.86 

90.81 

91.18 

55.43 

91.64 


200 

89.00 

90.38 

89.96 

91.13 

91.30 

83.89 

91.71 

Mixture 

30 

71.95 

76.19 

75.40 

73.93 

76.65 

55.51 

79.09 


50 

73.88 

77.66 

77.03 

76.63 

78.30 

55.13 

80.29 


100 

75.54 

78.91 

78.61 

79.13 

79.89 

52.48 

81.07 


200 

76.46 

79.66 

79.29 

80.21 

80.61 

70.77 

81.39 

RKHS 

30 

85.28 

88.63 

88.70 

88.30 

89.95 

62.56 

90.91 


50 

86.72 

89.45 

89.38 

89.81 

90.69 

61.24 

91.41 


100 

88.21 

89.91 

89.86 

90.81 

91.18 

55.43 

91.64 


200 

89.00 

90.38 

89.96 

91.13 

91.30 

83.89 

91.71 
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Table S3: Percentage of correct classification with kNN 


Models 

n 

mRMR-RD 

PLS 

MHR RK-VS RKs-VS 

Base 

kNN-Oracle 

All models 

30 

81.88 

82.45 

82.46 

82.28 

81.92 

79.61 

84.56 


50 

82.95 

83.49 

83.43 

83.75 

83.25 

80.96 

86.16 


100 

84.31 

84.77 

84.73 

85.59 

84.95 

82.60 

87.94 


200 

85.38 

85.79 

85.91 

87.16 

86.50 

83.99 

89.25 

Logistic OU 

30 

78.71 

79.22 

79.20 

78.58 

77.82 

75.63 

81.15 


50 

79.64 

80.04 

80.02 

79.98 

79.05 

76.87 

82.63 


100 

80.96 

81.13 

81.26 

81.66 

80.68 

78.44 

84.30 


200 

82.10 

82.07 

82.56 

83.21 

82.23 

79.73 

85.49 


Logistic out 30 

81.87 

82.71 

82.30 

81.91 

81.37 

79.50 

84.46 

50 

82.83 

83.52 

83.18 

83.13 

82.49 

80.62 

85.89 

100 

84.12 

84.52 

84.33 

84.90 

84.03 

82.02 

87.35 

200 

85.00 

85.31 

85.30 

86.23 

85.31 

83.14 

88.49 


Logistic B 

30 

83.29 

84.01 

83.94 

83.94 

83.04 

81.10 

86.61 


50 

84.38 

85.08 

84.90 

85.47 

84.55 

82.35 

88.24 


100 

85.68 

86.30 

86.31 

87.40 

86.41 

83.92 

90.19 


200 

86.78 

87.39 

87.63 

89.27 

88.25 

85.35 

91.66 


Logistic sB 

30 

84.00 

84.48 

84.55 

84.40 

83.66 

81.90 

86.59 


50 

84.87 

85.36 

85.31 

85.65 

84.93 

83.02 

88.24 


100 

86.09 

86.61 

86.62 

87.51 

86.62 

84.44 

90.11 


200 

87.07 

87.58 

87.84 

89.17 

88.35 

85.73 

91.59 


Logistic ssB 30 

85.92 

85.97 

86.35 

86.39 

86.09 

84.47 

88.01 

50 

86.86 

86.78 

87.11 

87.49 

87.10 

85.41 

89.44 

100 

87.93 

87.86 

88.05 

88.89 

88.55 

86.71 

91.04 

200 

88.89 

88.81 

88.75 

90.24 

89.88 

87.91 

92.34 


Gaussian 

30 

83.96 

85.35 

85.79 

86.16 

87.13 

83.20 

87.46 


50 

84.80 

86.61 

86.68 

87.62 

88.20 

84.99 

88.55 


100 

85.69 

87.85 

87.58 

88.91 

89.19 

86.61 

89.56 


200 

86.30 

88.74 

88.19 

89.68 

89.84 

87.94 

90.11 

Mixture 

30 

74.20 

74.40 

74.40 

74.42 

75.92 

71.05 

76.83 


50 

76.59 

76.92 

76.70 

77.45 

78.43 

73.92 

79.58 


100 

79.46 

79.68 

79.20 

80.76 

81.36 

77.32 

82.70 


200 

81.48 

81.51 

81.42 

83.21 

83.61 

79.98 

84.74 

RKHS 

30 

83.96 

85.35 

85.79 

86.16 

87.13 

83.20 

87.46 


50 

84.80 

86.61 

86.68 

87.62 

88.20 

84.99 

88.55 


100 

85.69 

87.85 

87.58 

88.91 

89.19 

86.61 

89.56 


200 

86.30 

88.74 

88.19 

89.68 

89.84 

87.94 

90.11 
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Table S4: Percentage of correct classification with SVM 


Models 

n 

mRMR-RD 

PLS 

MHR 

RK-VS RKb-VS 

Base 

SVM-Oracle 

All models 

30 

83.22 

84.12 

84.62 

84.28 

84.12 

83.86 

87.53 


50 

84.21 

85.04 

85.44 

85.60 

85.20 

85.01 

88.21 


100 

85.27 

86.03 

86.29 

86.96 

86.48 

86.20 

88.75 


200 

86.10 

86.79 

86.86 

87.90 

87.50 

87.07 

89.03 

Logistic OU 

30 

79.98 

80.79 

80.81 

80.19 

79.65 

80.18 

83.93 


50 

81.13 

81.64 

81.69 

81.66 

80.95 

81.36 

84.62 


100 

82.39 

82.51 

82.59 

83.15 

82.44 

82.50 

85.17 


200 

83.51 

83.30 

83.50 

84.32 

83.74 

83.42 

85.49 

Logistic out 

30 

83.38 

83.84 

84.33 

83.70 

83.28 

83.77 

87.24 


50 

84.37 

84.69 

85.14 

85.00 

84.39 

84.82 

87.88 


100 

85.43 

85.67 

86.07 

86.34 

85.75 

85.94 

88.37 


200 

86.15 

86.34 

86.71 

87.26 

86.74 

86.71 

88.64 

Logistic B 

30 

85.24 

85.81 

87.01 

86.56 

85.97 

86.01 

90.58 


50 

86.23 

86.83 

87.92 

88.11 

87.20 

87.17 

91.23 


100 

87.35 

87.92 

88.99 

89.58 

88.69 

88.50 

91.80 


200 

88.16 

88.85 

89.85 

90.71 

89.95 

89.50 

92.09 

Logistic sB 

30 

85.55 

85.98 

87.06 

86.68 

86.22 

86.22 

90.22 


50 

86.33 

86.96 

87.92 

87.86 

87.32 

87.32 

90.96 


100 

87.13 

88.01 

88.88 

89.41 

88.69 

88.51 

91.53 


200 

88.04 

88.84 

89.55 

90.41 

89.80 

89.40 

91.81 

Logistic ssB 


87.16 

87.31 

87.69 


88.25 

87.65 

90.08 


50 

87.93 

88.02 

88.28 


88.90 

88.47 

90.57 


100 

88.82 

88.96 

88.55 


89.77 

89.37 

91.00 


200 

89.47 

89.73 

88.54 

E!iBH 

90.57 

90.16 

91.25 


Gaussian 


86.42 

88.72 

88.97 

89.00 

89.99 

87.29 

90.54 


50 

87.33 

89.44 

89.27 

89.94 

90.49 

88.81 

91.02 


100 

88.48 

90.03 

89.60 

90.63 

90.93 

89.88 

91.38 



88.98 

90.41 

89.51 

91.03 

91.21 

90.48 

91.45 

Mixture 

30 

73.01 

76.52 

76.12 

75.53 

76.93 

74.88 

78.71 


50 

74.39 

77.90 

77.42 

77.50 

78.35 

76.51 

79.89 


100 

75.55 

79.27 

78.65 

79.41 

79.72 

78.20 

80.76 


200 

76.35 

80.10 


80.26 

80.50 

79.21 

81.16 

RKHS 

30 

86.42 

88.72 

88.97 

89.00 

89.99 

87.29 

90.54 


50 

87.33 

89.44 

89.27 

89.94 

90.49 

88.81 

91.02 


100 

88.48 

90.03 

89.60 

90.63 

90.93 

89.88 

91.38 


200 

88.98 

90.41 

89.51 

91.03 

91.21 

90.48 

91.45 
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